In [13]:
import lxml.etree
import csv
import os
import pandas as pd

Introduction

The purpose of this notebook is to build a field parser and extract the contents of various fields in the CWE 3.0 XML file so that the field content can be directly analyzed and stored into database. The raw XML file can be downloaded at http://cwe.mitre.org/data/xml/cwec_v3.0.xml.zip. Guided by CWE Introduction notebook, this notebook will focus on the detail structure under Weakness table and how parser functions work in order to extract two formats of field: fields with no nesting element and fields with nesting structure.

Although the overall structure of CWE XML file has been documented in CWE Introduction notebook, the Introduction notebook is built on version 2.9. Therefore, the following differences about weakness table between version 2.9 and 3.0 can be observed:

  • The order of four tables is changed and weakness table in version 3.0 is the first.
  • Several fields are removed or changed to other names in version 3.0: Time_of_Introduction, Maintenace_Notes, Causal_Nature, Research_Gaps, White_Box_Definitions, Terminology_Notes, Other_Notes, Enabling_Factors_for_Exploitation, Relevant_Properties

In [3]:
tree = lxml.etree.parse('cwec_v3.0.xml')
root = tree.getroot()

# Remove namespaces from XML.  
for elem in root.getiterator(): 
    if not hasattr(elem.tag, 'find'): continue  # (1)
    i = elem.tag.find('}') # Counts the number of characters up to the '}' at the end of the XML namespace within the XML tag
    if i >= 0: 
        elem.tag = elem.tag[i+1:] # Starts the tag a character after the '}'

In [4]:
for table in root: 
    print (table.tag)


Weaknesses
Categories
Views
External_References

Format and Field Parser

Although there are various kinds of field, in general, there are only three ways to store the field information in the CWE XML file: 1) fields with no nesting element, 2) fields with nesting element, 3) fields with attribute information.

Format CWE Field Example
Fields with no nesting element Description, Extended_Description, Likelihood_Of_Exploit, Background_Details
Fields with nesting element Potential_Mitigations, Weakness_Ordinalities, Common_Consequences, Alternate_Terms, Modes_Of_Introduction, Affected_Resources, Observed_Examples, Functional_Areas, Content_History, Detection_Methods
Fields with attribute information Demonstrative_Exampls, Taxonomy_Mappings, Applicable_Platforms, References,Related Attack Pattern

We will discuss the detail structure and how to parse the first two types of field below.

1.1 Fields with no nesting element

Typically, the fields in this format will keep of the information directly under the field element, without any nesting structure and attribute. For example, Description and Extended_Description are the fields in this format. There is no further nesting structure under the field element and thus cannot be extended (no plus sign on the left)

However, when parsing Extended_Description in cwe-1007, there are nesting html elements under Extended_Description element. In this case, we will remove the html tag and concatenate the contents under separate html elements

General case:

HTML elements under Extended_Description:

1.2 Parser function for field with no nesting element

Before introducing the parser function, we need a function that can write the dictionary that stores the field content to a CSV file. Function write_dict_to_csv will append the given dictionary to the end of the CSV file. If the file does not exist, the function will create a CSV file and take the csv_header as the header of this CSV file.


In [5]:
def write_dict_to_csv(output_file,csv_header,dict_data):
    '''
    Create a CSV file with headers and write a dictionary;
    If the file already existes, only append a dictionary.
    
    Args:
        output_file -- name of the output csv file
        csv_header -- the header of the output csv file. 
        dict_data -- the dictionary that will be writen into the CSV file. The number of 
                     element in the dictionary should be equal to or lower than the number of
                     headers of the CSV file. 
    
    Outcome:
        a new csv file with headers and one row that includes the information from the dictionary;
        or an existing CSV file with a new row that includes the information from the dictionary
    '''
    # create a file if the file does not exist; if exsits, open the file
    with open(output_file, 'a') as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=csv_header,lineterminator='\n')
        
        # check whether the csv file is empty
        if csv_file.tell()==0:
            # if empty, write header and the dictionary
            writer.writeheader()         
            writer.writerow(dict_data)
        else:
            # if not empty, only write the dictionary
            writer.writerow(dict_data)

Given the target field, function no_nesting_field_parser will extract the contents within the target field element and write cwe_id and content into a CSV file named by the target field. Each row in the output CSV file will include the following information:

  • cwe_id: The CWE identifier
  • field: The name of the target field
  • (field name)_content: The text information stored under the target field. The header of this column varies depending on the field. For example, the header will be 'description_content' if parsing 'Description' field

The following fields have been tested successfully: Description, Extended_Description, Likelihood_Of_Exploit, Background_Details.


In [10]:
def no_nesting_field_parser(target_field, root):
    '''
    Parse the field with no nesting element from cwec_v3.0.xml file and output the information to a csv file.
    
    Args:
        target_field -- the target field that will be parsed through this function. The format of this arg should be string.
        root -- the root element of the whole parsed tree. 
    Outcome:
        a csv file named by the field name. Each row will include the following information:
            - cwe_id: The CWE identifier
            - field: The name of the target field
            - (field name)_content:  The text information stored under the target field. The header varies depending on field. 
                                     For example, the header will be 'description_content' if parsing 'Description' field
    '''
    # define the path of target field. Here we select all element nodes that the tag is the target field
    target_field_path='Weakness/./'+target_field
    # extract weakness table in the XML
    weakness_table = root[0]
    #define the headers
    field_header=target_field.lower()+'_content' 
    output_header=['cwe_id','field',field_header] 
    #define path of the output file
    output_path=target_field+'.csv'

    # for each target field node
    for field in weakness_table.findall(target_field_path):
        # extract cwe_id from the parent node of the target field node
        cwe_id=field.getparent().attrib.get('ID')
        # extract the content under the target field
        field_entry_content=field.text
        
        # in case there are nested html tags under the field
        if field_entry_content.isspace()==True:
            for field_entry in field:
                # extract the content under html tags and concatenate
                field_entry_content=field_entry.text
                field_entry_content=field_entry_content+field_entry           
        
        # build the dictionary that is used to write
        field_entry_dict=dict()
        field_entry_dict['cwe_id']=cwe_id
        field_entry_dict['field']=target_field
        field_entry_dict[field_header.lower()]= field_entry_content.strip()
        
        # write the dictionary with headers to a CSV file
        write_dict_to_csv(output_path,output_header, field_entry_dict)

In [12]:
des='Description' 
extended_des='Extended_Description'
likelihood='Likelihood_Of_Exploit'
background='Background_Details'
no_nesting_field_parser(des,root)

After running the above codes, the file named by 'Description.csv' should be created under the same directory as this notebook. For parsing other fields, need to change the name of the target field.


In [16]:
no_nesting_field=pd.read_csv('Description.csv')
no_nesting_field.head(5)


Out[16]:
cwe_id field description_content
0 1004 Description The software uses a cookie to store sensitive ...
1 1007 Description The software displays information or identifie...
2 102 Description The application uses multiple validation forms...
3 1021 Description The web application does not restrict or incor...
4 1022 Description The web application does not restrict or incor...

2.1 Fields with nesting elements

Typically, the fields in this format will have a nested structured under the target field element. To understand the nesting structure, here we use the Common_Consequences field in cwe-1004 as the example. Under Common_Consequences element, there are two field entries named by 'Consequence', which represent two different individual consequences associated with the weakness. Under each consequence element, there are three entry elements (scope, impact, and note), which have the contents that our parser is intended to to extract.

General Case :

To understand the structure and the variable naming in the coding part, I generalized the structure of the fields in this format. Here is the general format:

<Target_Field>
    <Field_Entry1>
        <Entry_Element1> the content function will parse</Entry_Element1>
        <Entry_Element2> the content function will parse</Entry_Element2>
        <Entry_Element3> the content function will parse</Entry_Element3>
        <Entry_Element4> the content function will parse</Entry_Element4>
        ...
    </Field_Entry1>
    <Field_Entry2>
        <Entry_Element1> the content function will parse</Entry_Element1>
        <Entry_Element2> the content function will parse</Entry_Element2>
        <Entry_Element3> the content function will parse</Entry_Element3>
        <Entry_Element4> the content function will parse</Entry_Element4>
        ...
    </Field_Entry2>
    ...
</Target_Field>

Here are two special cases when parsing the nesting fields.

1) Muliple entry elements may share a same tag:

For example, a consequence of a weakness may have only one impact and note but multiple scopes. Therefore, in this case, the parser will extract and concatenate the contents that share a same tag under an individual field entry element.

2) HTML elements under entry element:

For some unknown reason, the content we aim to extract will be stored in html elements, such as li, div, ul,and o. Therefore, in this case, the parser will extract and concatenate the content that have html tag under a same entry_element. After extracting the content, the parser will also parse the tag information from their parent elements.

2.2 Parser Function for fields with nesting elements

Given the target field, function nesting_field_parser will extract the content within the target field element and write cwe_id and content into a CSV file named by the target field. Each row in the output CSV file will include the following information:

  • cwe_id: The CWE identifier
  • field: The name of the target field
  • tags under the field node, but exclude all html tags, including li, div, ul,and p.

There are two parts within function nesting_field_parser . The first part will generate all possible tags as the headers of the output CSV file by traversing all child element tags under each field entry. It is very important for the first part, because once the function writes the headers, it is computationally expensive to edit the first row later - we have to read all content of the original file and re-write to a new file. The function will exclude all HTML tags, such as li, div, ul, and p, because these html tags are meaningless and repetitive. The second part will extract the content from the nesting target field and then write to a CSV file by using function write_dict_to_csv .

The following fields have been tested successfully: Potential_Mitigations, Weakness_Ordinalities Common_Consequences, Alternate_Terms Modes_Of_Introduction, Affected_Resources Observed_Examples, Functional_Areas Content_History, etection_Methods


In [26]:
def nesting_field_parser(target_field, root):
    '''
    Parser the field with nested elements from cwec_v3.0.xml file and output the information to a csv file.
    The following fields have been tested successfully: 
        -Potential_Mitigations, Weakness_Ordinalities
        -Common_Consequences, Alternate_Terms
        -Modes_Of_Introduction, Affected_Resources
        -Observed_Examples, Functional_Areas
        -Content_History, Detection_Methods
    
    Args:
        target_field -- the target field that will be parsed through this function. The format of this arg should be string.
        root -- the root element of the parsed tree. 
    
    Outcome:
        a csv file named by the field name. Each row will include the following headers:
            - cwe_id: The CWE identifier
            - field: The name of the target field
            - tags under the field node, but exclude all html tags, including li, div, ul,and p.
    '''
    # define the path of target field. Here we select all element nodes that the tag is the target field
    target_field_path='Weakness/./'+target_field
    # extract weakness table in the XML
    weakness_table = root[0]
    # define the headers
    output_header=['cwe_id','field']
    # define path of the output file
    output_path=target_field+'.csv'

    ### 1.Generate all possible tags(column header in csv file) under the target field tree
    
    # for each target field node
    for field in weakness_table.findall(target_field_path):
        # for each field entry, in case there are multiple field entries under the target field node
        for field_entry in list(field):
            # traverse all entry_element nodes under each field entry
            for entry_element in field_entry.iter():
                # generate tag and content of each entry_element
                entry_element_tag=entry_element.tag
                entry_element_content=entry_element.text
                
                # exclude the tag of field entry node, since .iter() will return field entry node and its entry_element nodes
                if entry_element_content.isspace():
                    continue
                # exclude all html tags, such as li,div,ul,p
                if entry_element_tag=='li' or entry_element_tag=='div' or entry_element_tag=='p' or entry_element_tag=='ul':
                    continue
                # append the tag to the output_header list if it does not exist in the list
                if entry_element_tag.lower() not in output_header:
                    output_header.append(entry_element_tag.lower())
    
    ### 2.Extract the content from the nesting target field
    
    # for each target field node
    for field in weakness_table.findall(target_field_path):
        # extract cwe_id from the attribute of its parent node
        cwe_id=field.getparent().attrib.get('ID')
        # for each field entry node under the target field node
        for field_entry in list(field):
            
            # the dictionary that will be written to a CSV file
            entry_element_dict=dict()
            entry_element_dict['cwe_id']=cwe_id
            entry_element_dict['field']=target_field
            
            # traverse all entry_element nodes under each field entry
            for entry_element in field_entry.iter():
                # generate tag and content of each entry_element
                entry_element_tag=entry_element.tag
                entry_element_content=entry_element.text
                
                # skip the first field entry node 
                if entry_element_content.isspace():
                    continue
                
                #if the tag is html tag, such as li, div, p, and ul, the tag will be replaced by its parent tag
                while(entry_element_tag.lower() not in output_header):
                    entry_element_tag=entry_element.getparent().tag.lower()
                    entry_element=entry_element.getparent()

                #if there are multiple entry_element entries using a same tag, all content will be concatenated
                if entry_element_tag.lower() in entry_element_dict:
                    # add the concatenated content into the dictionary 
                    entry_element_dict[entry_element_tag.lower()]=entry_element_dict[entry_element_tag.lower()]+ ';'+entry_element_content
                    # if not, directly add the entry_element content into the dictionary
                else:
                    entry_element_dict[entry_element_tag.lower()]=entry_element_content
            # write the dictionary with headers to a CSV file    
            write_dict_to_csv(output_path,output_header,entry_element_dict)

In [19]:
mitigation="Potential_Mitigations" 
consequence='Common_Consequences'
mode='Modes_Of_Introduction' 
example='Observed_Examples'
content='Content_History' 
weakness='Weakness_Ordinalities' 
detection='Detection_Methods' 
term='Alternate_Terms'
resources='Affected_Resources'
function_area='Functional_Areas'

nesting_field_parser(consequence, root)

After running the above codes, the file named by 'Common_Consequences.csv' should be created under the same directory as this notebook. For parsing other fields, need to change the name of the target field.


In [22]:
nesting_field=pd.read_csv('Common_Consequences.csv')
nesting_field.head(5)


Out[22]:
cwe_id field scope impact note
0 1004 Common_Consequences Confidentiality Read Application Data If the HttpOnly flag is not set, then sensitiv...
1 1004 Common_Consequences Integrity Gain Privileges or Assume Identity If the cookie in question is an authentication...
2 1007 Common_Consequences Integrity;Confidentiality Other An attacker may ultimately redirect a user to ...
3 102 Common_Consequences Integrity Unexpected State NaN
4 1021 Common_Consequences Access Control Gain Privileges or Assume Identity;Bypass Prot... An attacker can trick a user into performing a...

2.3 Fields with attribute information

Typically, the fields in this format will store the information not only in the element but also as the attribute. For example, from the screenshot below, the attribute of Example_Code under Demonstrative_Examples field stores the information about the nature and the language of the example code. If the information stored in the attribute can be ignored, function nesting_field_parser can also work for the fields in this format.